Goto

Collaborating Authors

 Pelagonia Statistical Region


Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

Krsteski, Stefan, Tashkovska, Matea, Sazdov, Borjan, Gjoreski, Hristijan, Gerazov, Branislav

arXiv.org Artificial Intelligence

The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at github.com/LVSTCK for source code, and at huggingface.co/LVSTCK for pretrained model weights and data.


Closing the Modality Gap for Mixed Modality Search

Li, Binxu, Zhang, Yuhui, Wang, Xiaohan, Liang, Weixin, Schmidt, Ludwig, Yeung-Levy, Serena

arXiv.org Artificial Intelligence

Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.


Dialectal and Low-Resource Machine Translation for Aromanian

Jerpelea, Alexandru-Iulius, Rădoi, Alina, Nisioi, Sergiu

arXiv.org Artificial Intelligence

This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets, trained models, and associated tools are public: https://huggingface.co/aronlp and https://arotranslate.com


Artificial Intelligence in Cybersecurity: Building Resilient Cyber Diplomacy Frameworks

Stoltz, Michael

arXiv.org Artificial Intelligence

This paper explores how automation and artificial intelligence (AI) are transforming U.S. cyber diplomacy. Leveraging these technologies helps the U.S. manage the complexity and urgency of cyber diplomacy, improving decision-making, efficiency, and security. As global inter connectivity grows, cyber diplomacy, managing national interests in the digital space has become vital. The ability of AI and automation to quickly process vast data volumes enables timely responses to cyber threats and opportunities. This paper underscores the strategic integration of these tools to maintain U.S. competitive advantage and secure national interests. Automation enhances diplomatic communication and data processing, freeing diplomats to focus on strategic decisions. AI supports predictive analytics and real time decision making, offering critical insights and proactive measures during high stakes engagements. Case studies show AIs effectiveness in monitoring cyber activities and managing international cyber policy. Challenges such as ethical concerns, security vulnerabilities, and reliance on technology are also addressed, emphasizing human oversight and strong governance frameworks. Ensuring proper ethical guidelines and cybersecurity measures allows the U.S. to harness the benefits of automation and AI while mitigating risks. By adopting these technologies, U.S. cyber diplomacy can become more proactive and effective, navigating the evolving digital landscape with greater agility.


Hybrid Approach to Identify Druglikeness Leading Compounds against COVID-19 3CL Protease

Aqeel, Imra, Majid, Abdul

arXiv.org Artificial Intelligence

SARS-COV-2 is a positive single-strand RNA-based macromolecule that has caused the death of more than 6.3 million people since June 2022. Moreover, by disturbing global supply chains through lockdown, the virus has indirectly caused devastating damage to the global economy. It is vital to design and develop drugs for this virus and its various variants. In this paper, we developed an in-silico study-based hybrid framework to repurpose existing therapeutic agents in finding drug-like bioactive molecules that would cure Covid-19. We employed the Lipinski rules on the retrieved molecules from the ChEMBL database and found 133 drug-likeness bioactive molecules against SARS coronavirus 3CL Protease. Based on standard IC50, the dataset was divided into three classes active, inactive, and intermediate. Our comparative analysis demonstrated that the proposed Extra Tree Regressor (ETR) based QSAR model has improved prediction results related to the bioactivity of chemical compounds as compared to Gradient Boosting, XGBoost, Support Vector, Decision Tree, and Random Forest based regressor models. ADMET analysis is carried out to identify thirteen bioactive molecules with ChEMBL IDs 187460, 190743, 222234, 222628, 222735, 222769, 222840, 222893, 225515, 358279, 363535, 365134 and 426898. These molecules are highly suitable drug candidates for SARS-COV-2 3CL Protease. In the next step, the efficacy of bioactive molecules is computed in terms of binding affinity using molecular docking and then shortlisted six bioactive molecules with ChEMBL IDs 187460, 222769, 225515, 358279, 363535, and 365134. These molecules can be suitable drug candidates for SARS-COV-2. It is anticipated that the pharmacologist/drug manufacturer would further investigate these six molecules to find suitable drug candidates for SARS-COV-2. They can adopt these promising compounds for their downstream drug development stages.


Chi-square-based scoring function for categorization of MEDLINE citations

Kastrin, Andrej, Peterlin, Borut, Hristovski, Dimitar

arXiv.org Machine Learning

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine learning algorithms (support vector machines, decision trees, na\"ive Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine learning algorithms. Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.